Choosing a Twice More Accurate Dot Product Implementation
نویسندگان
چکیده
The fused multiply and add (FMA) operation computes a floating point multiplication followed by an addition or a subtraction as a single floating point operation. Intel IA-64, IBM RS/6000 and PowerPC architectures implement this FMA operation. The aim of this talk is to study how the FMA improves the computation of dot product with classical and compensated algorithms. The latters double the accuracy of the former at the same working precision. Six algorithms are considered. We present associated theoretical error bounds. Numerical experiments illustrate the actual efficiency in terms of accuracy and running time. We show that the FMA does not improve in a significant way the accuracy of the result whereas it increases significantly the actual speed of the algorithms. The fused multiply and add (FMA) operation computes a floating point multiplication followed by an addition or a subtraction as a single floating point operation. This means that only one final rounding (to the working precision) error is generated by a FMA whereas two occur in the classical implementation of x× y + z. Intel IA-64, IBM RS/6000 and PowerPC architectures implement this FMA operation. On the Itanium processor, the FMA operation enables a multiplication and an addition to be performed in the same number of cycles than one multiplication or one addition [4]. The FMA operation seems to be advantageous for both speed and accuracy. Indeed, it approximately halves the number of rounding errors in many numerical algorithms. This is the case for example within the computation of a dot product of two n-vectors where just n rounding errors occur instead of 2n− 1 without FMA. Moreover, it is well known that FMA yields an efficient computation of the rounding error generated by a floating point product. Such rounding error computation at the current working precision is a key task when implementing multi-precision libraries as double-double or quad-double ones [1] or even when designing compensated algorithms. Compensated algorithms implement inner computation of the rounding errors generated by the original algorithms and so provide more accurate results; [6, 5] are examples of compensated summation and dot product. The latter reference recently proved that these compensated implementations double the accuracy of the classical algorithm still running in the current working precision. Here we study how the FMA can improve the computation of dot products in terms of accuracy and running time. First, we consider the classical dot product computed at the working precision with or without FMA. We report the theoretical error analysis (worst case bounds) and some experimental results to show that the use of FMA only slightly improve the accuracy of the computed dot product, even if the number of rounding errors is halved. Nevertheless, the accuracy provided by the classical dot product may not be sufficient when applied to ill conditioned dot products. Such cases appear for instance when computing residuals for ill conditioned linear systems. So we also consider accurate dot products whose computed result is as accurate as if computed in twice the working precision. Here we consider the classical dot product performed with double-double computation as it can be found in the XBLAS library [3] and the compensated dot product from [5] where the FMA is used to compute the rounding error generated by each product. We also present a new compensated dot product using a recent algorithm by Boldo and Muller [2] that computes the exact result of a FMA operation as the unevaluated sum of three floating point values. We present theoretical error bounds to prove that all these algorithms provide results as accurate as if computed in twice the working precision. Then we compare these implementations in terms of practical computing time to identify the best choice to double the computing precision. Our experimental results show that
منابع مشابه
Accurate Sum and Dot Product
Algorithms for summation and dot product of floating point numbers are presented which are fast in terms of measured computing time. We show that the computed results are as accurate as if computed in twice or K-fold working precision, K ≥ 3. For twice the working precision our algorithms for summation and dot product are some 40 % faster than the corresponding XBLAS routines while sharing simi...
متن کاملMore Instruction Level Parallelism Explains the Actual Efficiency of Compensated Algorithms
The compensated Horner algorithm and the Horner algorithm with double-double arithmetic improve the accuracy of polynomial evaluation in IEEE-754 floating point arithmetic. Both yield a polynomial evaluation as accurate as if it was computed with the classic Horner algorithm in twice the working precision. Both algorithms also share the same low-level computation of the floating point rounding ...
متن کاملImplementation of clinic-based modified-directly observed therapy (m-DOT) for ART; experiences in Mombasa, Kenya.
The effectiveness of modified-directly observed therapy (m-DOT), an adherence support intervention adapted from TB DOTS programmes, has been documented. Describing the implementation process and acceptability of this intervention is important for scaling up, replication in other settings and future research. In a randomised trial in Mombasa, Kenya, patients were assigned to m-DOT or standard of...
متن کاملSimple and Accurate Detection of Vibrio Cholera Using Triplex Dot Blotting Assay
Cholera outbreak is more common in developing countries. The causative agent of the disease is Vibrio cholerae strains O1 and O139. Traditional diagnostic testing for Vibrio is not always reliable, because Vibrio can enter a viable but non cultivable state. Therefore, nucleic acid-based tests have emerged as a useful alternative to traditional enrichment testing. In this investigation, a...
متن کاملAccurate Floating Point Arithmetic through Hardware Error-Free Transformations
This paper presents a hardware approach to performing accurate floating point addition and multiplication using the idea of errorfree transformations. Specialized iterative algorithms are implemented for computing arbitrarily accurate sums and dot products. The results of a Xilinx Virtex 6 implementation are given, area and performance are compared against standard floating point units and it i...
متن کامل